The NYC Rat Sightings dataset describes mostly the location where there are rat sightings, the dates when it happened and which type of building it happened on. In the original dataset there are also many empty columns or columns with only one value, which just means the data was organised poorly, and we will not use these.
We will analyse if sightings depend on the month they happen. Additionally, we will analyse a dataset that describes borough populations so we can analyse if population density compares to number of rat sightings and which boroughs have more density of rat sightings.
The dataset we will be using can be found on kaggle.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggcharts) # for bar_chart() function
library(pander) # for pander() function, make the descriptive stat have pretty output
library(psych) # for psych() function, for descriptive stat
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
We will import an extension of the ggplot2 library, the
ggmap library. If you don’t have the ggmap
package installed, please do so by running
install.packages("ggmap") on your R console.
library("ggmap")
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
The dataset initially looks like this:
rat_dataset <-
read_csv("Rat_Sightings.csv", show_col_types = FALSE)
head(rat_dataset)
## # A tibble: 6 × 52
## Uniqu…¹ Creat…² Close…³ Agency Agenc…⁴ Compl…⁵ Descr…⁶ Locat…⁷ Incid…⁸ Incid…⁹
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 3.15e7 09/04/… 09/18/… DOHMH Depart… Rodent Rat Si… 3+ Fam… 10006 <NA>
## 2 3.15e7 09/04/… 10/28/… DOHMH Depart… Rodent Rat Si… Commer… 10306 2270 H…
## 3 3.15e7 09/04/… <NA> DOHMH Depart… Rodent Rat Si… 1-2 Fa… 10310 758 PO…
## 4 3.15e7 09/04/… 09/14/… DOHMH Depart… Rodent Rat Si… 3+ Fam… 11206 198 SC…
## 5 3.15e7 09/04/… 09/22/… DOHMH Depart… Rodent Rat Si… 3+ Fam… 10462 2138 W…
## 6 3.15e7 09/04/… 09/22/… DOHMH Depart… Rodent Rat Si… 3+ Fam… 11231 179 LU…
## # … with 42 more variables: `Street Name` <chr>, `Cross Street 1` <chr>,
## # `Cross Street 2` <chr>, `Intersection Street 1` <chr>,
## # `Intersection Street 2` <chr>, `Address Type` <chr>, City <chr>,
## # Landmark <lgl>, `Facility Type` <chr>, Status <chr>, `Due Date` <chr>,
## # `Resolution Action Updated Date` <chr>, `Community Board` <chr>,
## # Borough <chr>, `X Coordinate (State Plane)` <dbl>,
## # `Y Coordinate (State Plane)` <dbl>, `Park Facility Name` <chr>, …
In order to start cleaning up the dataset, a few custom functions need to be written in order to split out information from several columns.
First off, we have written a function to split a date into its constituent components. The function, as seen below, takes a date string and a specified x value. It formats the input date as an ISO8601 date-time and subsequently isolates each component. Based on the provided x-value, the function puts out the required component.
.DateSplit <- function(txtdate, x){
date <- parse_date(txtdate, format = "%m/%d/%Y %I:%M:%S %p")
time <- parse_time(txtdate, format = "%m/%d/%Y %I:%M:%S %p")
year <- as.factor(format(date, format = "%Y"))
month <- as.factor(format(date, format = "%m"))
day <- as.numeric(format(date, format = "%d"))
if (x == 1) {value <- year}
else if (x == 2) {value <- month}
else if (x == 3) {value <- day}
else if (x == 4) {value <- time}
return(value)
}
The second function can be used to isolate a specific number from a data point (in our case, this is either a house number of a community board number). The function only takes an address string. It first filters out NA values, after which it extracts and returns numerals from the input string.
.GetNumber <- function(address){
if(!identical(address, NA)){
nums <- str_extract(address, "[0-9]+")
return(nums)
}
}
With the necessary tools prepared, the dataset can be cleaned and prepared for further analysis. Empty columns and columns with only a single value represented are filtered, and the columns that are left are renamed to remove spaces. Next we split dates with the aforementioned .DateSplit() function, and use the .GetNumber() function to isolate relevant numeric values from strings. Following the use of these functions, the now redundant columns are removed.
rat_dataset <-
read_csv("Rat_Sightings.csv", show_col_types = FALSE) %>%
rename(Unique_Key = "Unique Key", Created_Date = "Created Date", Closed_Date = "Closed Date", Location_Type = "Location Type", Incident_Zip = "Incident Zip", Incident_Address = "Incident Address", Street_Name = "Street Name", Cross_Street_1 = "Cross Street 1", Cross_Street_2 = "Cross Street 2", Intersection_Street_1 = "Intersection Street 1", Intersection_Street_2 = "Intersection Street 2", Address_Type = "Address Type", Status = "Status", Due_Date = "Due Date", RAU_Date = "Resolution Action Updated Date", Community_Board = "Community Board", Borough = "Borough", Latitude = "Latitude", Longitude = "Longitude") %>%
mutate(Created_Year = .DateSplit(Created_Date, 1), Created_Month = .DateSplit(Created_Date, 2), Created_Day = .DateSplit(Created_Date, 3)) %>%
mutate(Closed_Year = .DateSplit(Closed_Date, 1), Closed_Month = .DateSplit(Closed_Date, 2), Closed_Day = .DateSplit(Closed_Date, 3)) %>%
mutate(Due_Year = .DateSplit(Due_Date, 1), Due_Month = .DateSplit(Due_Date, 2), Due_Day = .DateSplit(Due_Date, 3), Due_Time = .DateSplit(Due_Date, 4)) %>%
mutate(RAU_Year = .DateSplit(RAU_Date, 1), RAU_Month = .DateSplit(RAU_Date, 2), RAU_Day = .DateSplit(RAU_Date, 3), RAU_Time = .DateSplit(RAU_Date, 4)) %>%
mutate(Address_Number = .GetNumber(Incident_Address)) %>%
mutate(Community_Board_Number = .GetNumber(Community_Board)) %>%
select(Unique_Key, Created_Year, Created_Month, Created_Day, Closed_Year, Closed_Month, Closed_Day, Location_Type, Incident_Zip, Address_Number, Street_Name, Cross_Street_1, Cross_Street_2, Intersection_Street_1, Intersection_Street_2, Address_Type, Status, Due_Year, Due_Month, Due_Day, Due_Time, RAU_Year, RAU_Month, RAU_Day, RAU_Time, Community_Board_Number, Borough, Latitude, Longitude)
We also want to make sure that all of the features have the proper value type.
rat_dataset$Location_Type <-
as.factor(rat_dataset$Location_Type)
rat_dataset$Street_Name <-
as.factor(rat_dataset$Street_Name)
rat_dataset$Cross_Street_1 <-
as.factor(rat_dataset$Cross_Street_1)
rat_dataset$Cross_Street_2 <-
as.factor(rat_dataset$Cross_Street_2)
rat_dataset$Address_Type <-
as.factor(rat_dataset$Address_Type)
rat_dataset$Status <-
as.factor(rat_dataset$Status)
rat_dataset$Borough <-
as.factor(rat_dataset$Borough)
So, our dataset after cleaning is as follows:
head(rat_dataset)
## # A tibble: 6 × 29
## Unique_Key Created_Y…¹ Creat…² Creat…³ Close…⁴ Close…⁵ Close…⁶ Locat…⁷ Incid…⁸
## <dbl> <fct> <fct> <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 31464015 2015 09 4 2015 09 18 3+ Fam… 10006
## 2 31464024 2015 09 4 2015 10 28 Commer… 10306
## 3 31464025 2015 09 4 <NA> <NA> NA 1-2 Fa… 10310
## 4 31464026 2015 09 4 2015 09 14 3+ Fam… 11206
## 5 31464027 2015 09 4 2015 09 22 3+ Fam… 10462
## 6 31464188 2015 09 4 2015 09 22 3+ Fam… 11231
## # … with 20 more variables: Address_Number <chr>, Street_Name <fct>,
## # Cross_Street_1 <fct>, Cross_Street_2 <fct>, Intersection_Street_1 <chr>,
## # Intersection_Street_2 <chr>, Address_Type <fct>, Status <fct>,
## # Due_Year <fct>, Due_Month <fct>, Due_Day <dbl>, Due_Time <time>,
## # RAU_Year <fct>, RAU_Month <fct>, RAU_Day <dbl>, RAU_Time <time>,
## # Community_Board_Number <chr>, Borough <fct>, Latitude <dbl>,
## # Longitude <dbl>, and abbreviated variable names ¹Created_Year, …
pander(describe(rat_dataset), caption="Descriptive statistics")
| vars | n | mean | sd | median | |
|---|---|---|---|---|---|
| Unique_Key | 1 | 101914 | 28158637 | 6015376 | 28836804 |
| Created_Year* | 2 | 101914 | 4.877 | 2.277 | 5 |
| Created_Month* | 3 | 101914 | 6.475 | 2.997 | 7 |
| Created_Day | 4 | 101914 | 15.69 | 8.719 | 16 |
| Closed_Year* | 5 | 90983 | 7.986 | 2.161 | 8 |
| Closed_Month* | 6 | 90983 | 6.67 | 3.087 | 7 |
| Closed_Day | 7 | 90983 | 15.86 | 8.687 | 16 |
| Location_Type* | 8 | 101908 | 5.404 | 5.046 | 3 |
| Incident_Zip | 9 | 101578 | 10729 | 631.2 | 10472 |
| Address_Number* | 10 | 89995 | 2091 | 1369 | 2056 |
| Street_Name* | 11 | 92839 | 3503 | 1896 | 3496 |
| Cross_Street_1* | 12 | 85258 | 2118 | 1374 | 1846 |
| Cross_Street_2* | 13 | 85225 | 2187 | 1410 | 1975 |
| Intersection_Street_1* | 14 | 8925 | 1123 | 716.2 | 1131 |
| Intersection_Street_2* | 15 | 8925 | 1226 | 706.7 | 1211 |
| Address_Type* | 16 | 101568 | 1.521 | 1.017 | 1 |
| Status* | 17 | 101914 | 2.639 | 1.3 | 2 |
| Due_Year* | 18 | 101797 | 6.924 | 2.279 | 7 |
| Due_Month* | 19 | 101797 | 6.901 | 3.026 | 7 |
| Due_Day | 20 | 101797 | 15.67 | 8.734 | 16 |
| Due_Time | 21 | 101797 | NA | NA | NA |
| RAU_Year* | 22 | 101911 | 5.899 | 2.263 | 6 |
| RAU_Month* | 23 | 101911 | 6.656 | 3.032 | 7 |
| RAU_Day | 24 | 101911 | 15.82 | 8.707 | 16 |
| RAU_Time | 25 | 101911 | NA | NA | NA |
| Community_Board_Number* | 26 | 78783 | 7.903 | 4.296 | 7 |
| Borough* | 27 | 101914 | 2.495 | 1.112 | 2 |
| Latitude | 28 | 101208 | 40.74 | 0.08222 | 40.73 |
| Longitude | 29 | 101208 | -73.93 | 0.07062 | -73.94 |
| trimmed | mad | min | max | |
|---|---|---|---|---|
| Unique_Key | 28434708 | 7389234 | 11464394 | 37197000 |
| Created_Year* | 4.971 | 2.965 | 1 | 8 |
| Created_Month* | 6.48 | 2.965 | 1 | 12 |
| Created_Day | 15.68 | 10.38 | 1 | 31 |
| Closed_Year* | 8.056 | 2.965 | 1 | 11 |
| Closed_Month* | 6.707 | 2.965 | 1 | 12 |
| Closed_Day | 15.87 | 10.38 | 1 | 31 |
| Location_Type* | 4.563 | 2.965 | 1 | 20 |
| Incident_Zip | 10734 | 696.8 | 83 | 100354 |
| Address_Number* | 2082 | 1926 | 1 | 4341 |
| Street_Name* | 3523 | 2371 | 1 | 6744 |
| Cross_Street_1* | 2068 | 1680 | 1 | 4645 |
| Cross_Street_2* | 2149 | 1773 | 1 | 4730 |
| Intersection_Street_1* | 1104 | 947.4 | 1 | 2436 |
| Intersection_Street_2* | 1224 | 889.6 | 1 | 2434 |
| Address_Type* | 1.276 | 0 | 1 | 5 |
| Status* | 2.48 | 0 | 1 | 5 |
| Due_Year* | 7.029 | 2.965 | 1 | 10 |
| Due_Month* | 6.991 | 2.965 | 1 | 12 |
| Due_Day | 15.66 | 10.38 | 1 | 31 |
| Due_Time | NA | NA | Inf | -Inf |
| RAU_Year* | 5.992 | 2.965 | 1 | 9 |
| RAU_Month* | 6.691 | 2.965 | 1 | 12 |
| RAU_Day | 15.83 | 10.38 | 1 | 31 |
| RAU_Time | NA | NA | Inf | -Inf |
| Community_Board_Number* | 7.608 | 4.448 | 1 | 22 |
| Borough* | 2.434 | 1.483 | 1 | 6 |
| Latitude | 40.74 | 0.08913 | 40.5 | 40.91 |
| Longitude | -73.93 | 0.05031 | -74.25 | -73.7 |
| range | skew | kurtosis | se | |
|---|---|---|---|---|
| Unique_Key | 25732606 | -0.3106 | -1.038 | 18843 |
| Created_Year* | 7 | -0.2617 | -1.182 | 0.007133 |
| Created_Month* | 11 | -0.03346 | -0.8644 | 0.009388 |
| Created_Day | 30 | 0.01144 | -1.171 | 0.02731 |
| Closed_Year* | 10 | -0.2367 | -1.133 | 0.007164 |
| Closed_Month* | 11 | -0.09799 | -0.9286 | 0.01023 |
| Closed_Day | 30 | -0.007812 | -1.158 | 0.0288 |
| Location_Type* | 19 | 1.481 | 1.232 | 0.01581 |
| Incident_Zip | 100271 | 27.09 | 4017 | 1.98 |
| Address_Number* | 4340 | 0.03505 | -1.413 | 4.565 |
| Street_Name* | 6743 | -0.0644 | -1.129 | 6.223 |
| Cross_Street_1* | 4644 | 0.2506 | -1.205 | 4.707 |
| Cross_Street_2* | 4729 | 0.1936 | -1.225 | 4.831 |
| Intersection_Street_1* | 2435 | 0.1354 | -1.234 | 7.581 |
| Intersection_Street_2* | 2433 | 0.03744 | -1.201 | 7.48 |
| Address_Type* | 4 | 1.648 | 1.065 | 0.00319 |
| Status* | 4 | 1.189 | -0.3613 | 0.004073 |
| Due_Year* | 9 | -0.2765 | -1.171 | 0.007144 |
| Due_Month* | 11 | -0.203 | -0.872 | 0.009484 |
| Due_Day | 30 | 0.01548 | -1.173 | 0.02737 |
| Due_Time | -Inf | NA | NA | NA |
| RAU_Year* | 8 | -0.2614 | -1.182 | 0.007089 |
| RAU_Month* | 11 | -0.09621 | -0.8734 | 0.009499 |
| RAU_Day | 30 | -0.0009365 | -1.166 | 0.02727 |
| RAU_Time | -Inf | NA | NA | NA |
| Community_Board_Number* | 21 | 0.5372 | -0.4965 | 0.01531 |
| Borough* | 5 | 0.4123 | -0.5813 | 0.003484 |
| Latitude | 0.4134 | -0.02161 | -0.8435 | 0.0002585 |
| Longitude | 0.5528 | -0.5411 | 2.842 | 0.000222 |
Where are the sightings more concentrated?
Firstly, we take a look at how the sightings are distributed:
ggplot(rat_dataset, mapping = aes(y = Longitude)) +
geom_boxplot() +
theme_minimal()
ggplot(rat_dataset, mapping = aes(y = Latitude)) +
geom_boxplot() +
theme_minimal()
There is a concentration in the middle of the range, with many of
what geom_boxplot() considers outliers. So the part of New
York we will be looking at is:
data4lat <-
rat_dataset %>%
select(Latitude) %>%
filter(!is.na(Latitude))
max_lat <- max(data4lat)
min_lat <- min(data4lat)
data4lon <-
rat_dataset %>%
select(Longitude) %>%
filter(!is.na(Longitude))
max_lon <- max(data4lon)
min_lon <- min(data4lon)
newyork.map <- get_map(location= 'New York')
## Source : https://maps.googleapis.com/maps/api/staticmap?center=New%20York&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-0bVFQTH0uA
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York&key=xxx-0bVFQTH0uA
sightings_map <-
ggmap(newyork.map) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("New York Map")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
sightings_map
From the distribution on the boxplots, we can assume that most of the sightings are concentrated in the middle of the map, which we can confirm by plotting the sightings. There is also a smaller concentration at the top middle of the map.
sightings_map <-
ggmap(newyork.map) +
geom_point(data = rat_dataset, mapping = aes(x = Longitude, y = Latitude), alpha = 0.1) +
stat_density2d(data = rat_dataset, mapping = aes(x = Longitude, y = Latitude)) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Concentration of Rat Sightings in New York")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
sightings_map
It seems like the sightings are mostly concentrated in 3 boroughs: Manhattan, Bronx and Brooklyn, with Staten Island having the lowest concentration by far.
Do the sighting locations change over the years?
years_map <-
ggmap(newyork.map) +
geom_point(data = rat_dataset, mapping = aes(x = Longitude, y = Latitude, color = Created_Year), alpha = 0.3) +
facet_wrap(~ Created_Year) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Concentration of Rat Sightings over the Years")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
years_map
It definitely looks like that is not the case, as the maps all look basically exactly the same, meaning the actual locations of the sightings are constant, which may indicate to no interventions ever having been done in any of the places.
Next up, we wanted to find out whether there are relative differences between boroughs when it comes to rat sightings, or if they are the same, when we account for differences in population and area respectively.
Summarize Rat Sightings per borough per year from the original dataset
We started of by creating a new dataset centered around the boroughs.
From our main dataset (rat_dataset), we took the borough
column, and subsequently counted how often each borough was represented
per year.
Side note: 2008 and 2009 have been ignored because they only have 18 and 1 rat sightings respectively.
Borough_Rats <-
rat_dataset %>%
group_by(Borough, Due_Year) %>%
tally(sort = TRUE) %>%
pivot_wider(names_from = Due_Year, values_from = n) %>%
select(Borough, "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017") %>%
drop_na()
head(Borough_Rats)
## # A tibble: 5 × 9
## # Groups: Borough [5]
## Borough `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017`
## <fct> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 BROOKLYN 3237 3631 3574 3359 3887 5136 5999 5795
## 2 MANHATTAN 2734 2410 2812 3053 3467 4063 4648 3563
## 3 BRONX 1938 2145 2145 2110 2721 3158 3455 3026
## 4 QUEENS 1511 1582 1512 1687 1713 2065 2409 2316
## 5 STATEN ISLAND 619 550 530 649 603 590 732 643
Creating a dataset for borough population
In order to compare borough by population, we first need a dataset which contains population (over time). The dataset we will be using can be found on kaggle.
It should be noted that this dataset only contains census data, which only happens every 10 years. Therefore, we manually added population data for the remaining years (which can be found here: https://fred.stlouisfed.org/categories/29191)
Lastly, we also manually added area data from Wikipedia.
Borough_Population <-
read_csv("New_York_City_Population_by_Borough__1950_-_2040.csv", show_col_types = FALSE) %>%
select("Borough", "Area", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017") %>%
mutate(Borough = toupper(Borough)) %>%
inner_join(Borough_Rats, by = "Borough") %>%
rename(Borough = "Borough", Area = "Area", Population_2010 = "2010.x", Rat_Sightings_2010 = "2010.y", Population_2011 = "2011.x", Rat_Sightings_2011 = "2011.y", Population_2012 = "2012.x", Rat_Sightings_2012 = "2012.y", Population_2013 = "2013.x", Rat_Sightings_2013 = "2013.y", Population_2014 = "2014.x", Rat_Sightings_2014 = "2014.y", Population_2015 = "2015.x", Rat_Sightings_2015 = "2015.y", Population_2016 = "2016.x", Rat_Sightings_2016 = "2016.y", Population_2017 = "2017.x", Rat_Sightings_2017 = "2017.y") %>%
relocate(Borough, Area, Population_2010, Rat_Sightings_2010, Population_2011, Rat_Sightings_2011, Population_2012, Rat_Sightings_2012, Population_2013, Rat_Sightings_2013, Population_2014, Rat_Sightings_2014, Population_2015, Rat_Sightings_2015, Population_2016, Rat_Sightings_2016, Population_2017, Rat_Sightings_2017)
head(Borough_Population)
## # A tibble: 5 × 18
## Borough Area Popul…¹ Rat_S…² Popul…³ Rat_S…⁴ Popul…⁵ Rat_S…⁶ Popul…⁷ Rat_S…⁸
## <chr> <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
## 1 BRONX 110 1385108 1938 1396954 2145 1411087 2145 1421498 2110
## 2 BROOKLYN 183. 2552911 3237 2540918 3631 2568538 3574 2587759 3359
## 3 MANHATT… 59.1 1585873 2734 1608717 2410 1624573 2812 1628379 3053
## 4 QUEENS 280 2250002 1511 2255261 1582 2271920 1512 2286788 1687
## 5 STATEN … 152 468730 619 471014 550 470597 530 471783 649
## # … with 8 more variables: Population_2014 <dbl>, Rat_Sightings_2014 <int>,
## # Population_2015 <dbl>, Rat_Sightings_2015 <int>, Population_2016 <dbl>,
## # Rat_Sightings_2016 <int>, Population_2017 <dbl>, Rat_Sightings_2017 <int>,
## # and abbreviated variable names ¹Population_2010, ²Rat_Sightings_2010,
## # ³Population_2011, ⁴Rat_Sightings_2011, ⁵Population_2012,
## # ⁶Rat_Sightings_2012, ⁷Population_2013, ⁸Rat_Sightings_2013
Creating a database for rats per population
We are interested in comparing boroughs, so we need to generate relative values to compare them on. In order to do this, we calculate the rate of rat sightings per 100.000 citizens.
Rats_Per_Pop <-
Borough_Population %>%
mutate('2010' = round(((Rat_Sightings_2010 / Population_2010) * 100000), digits = 2), '2011' = round(((Rat_Sightings_2011 / Population_2011) * 100000), digits = 2), '2012' = round(((Rat_Sightings_2012 / Population_2012) * 100000), digits = 2), '2013' = round(((Rat_Sightings_2013 / Population_2013) * 100000), digits = 2), '2014' = round(((Rat_Sightings_2014 / Population_2014) * 100000), digits = 2), '2015' = round(((Rat_Sightings_2015 / Population_2015) * 100000), digits = 2), '2016' = round(((Rat_Sightings_2016 / Population_2016) * 100000), digits = 2), '2017' = round(((Rat_Sightings_2017 / Population_2017) * 100000), digits = 2)) %>%
select(Borough, '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017')
head(Rats_Per_Pop)
## # A tibble: 5 × 9
## Borough `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 140. 154. 152. 148. 190. 219. 239. 210.
## 2 BROOKLYN 127. 143. 139. 130. 149. 197. 230. 223.
## 3 MANHATTAN 172. 150. 173. 187. 212. 248. 284. 0
## 4 QUEENS 67.2 70.2 66.6 73.8 74.5 89.6 104. 101.
## 5 STATEN ISLAND 132. 117. 113. 138. 128. 125. 154. 135.
Creating a database for rats per area
For area, we followed the same process, but for rat sightings per kilometre squared.
Rats_Per_Km <-
Borough_Population %>%
mutate('2010' = round((Rat_Sightings_2010 / Area), digits = 2), '2011' = round((Rat_Sightings_2011 / Area), digits = 2), '2012' = round((Rat_Sightings_2012 / Area), digits = 2), '2013' = round((Rat_Sightings_2013 / Area), digits = 2), '2014' = round((Rat_Sightings_2014 / Area), digits = 2), '2015' = round((Rat_Sightings_2015 / Area), digits = 2), '2016' = round((Rat_Sightings_2016 / Area), digits = 2), '2017' = round((Rat_Sightings_2017 / Area), digits = 2)) %>%
select(Borough, '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017')
head(Rats_Per_Km)
## # A tibble: 5 × 9
## Borough `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 17.6 19.5 19.5 19.2 24.7 28.7 31.4 27.5
## 2 BROOKLYN 17.6 19.8 19.5 18.3 21.2 28 32.7 31.6
## 3 MANHATTAN 46.3 40.8 47.6 51.7 58.7 68.8 78.6 60.3
## 4 QUEENS 5.4 5.65 5.4 6.03 6.12 7.38 8.6 8.27
## 5 STATEN ISLAND 4.07 3.62 3.49 4.27 3.97 3.88 4.82 4.23
Visualization of borough comparisons
In order to create the necessary visualizations, we slightly need to restructure our datasets.
Rat/population rates
P_Rates <-
Rats_Per_Pop %>%
pivot_longer(c('2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017'), names_to = "Year", values_to = "Rate")
as.factor(P_Rates$Year)
## [1] 2010 2011 2012 2013 2014 2015 2016 2017 2010 2011 2012 2013 2014 2015 2016
## [16] 2017 2010 2011 2012 2013 2014 2015 2016 2017 2010 2011 2012 2013 2014 2015
## [31] 2016 2017 2010 2011 2012 2013 2014 2015 2016 2017
## Levels: 2010 2011 2012 2013 2014 2015 2016 2017
head(P_Rates, n = 20)
## # A tibble: 20 × 3
## Borough Year Rate
## <chr> <chr> <dbl>
## 1 BRONX 2010 140.
## 2 BRONX 2011 154.
## 3 BRONX 2012 152.
## 4 BRONX 2013 148.
## 5 BRONX 2014 190.
## 6 BRONX 2015 219.
## 7 BRONX 2016 239.
## 8 BRONX 2017 210.
## 9 BROOKLYN 2010 127.
## 10 BROOKLYN 2011 143.
## 11 BROOKLYN 2012 139.
## 12 BROOKLYN 2013 130.
## 13 BROOKLYN 2014 149.
## 14 BROOKLYN 2015 197.
## 15 BROOKLYN 2016 230.
## 16 BROOKLYN 2017 223.
## 17 MANHATTAN 2010 172.
## 18 MANHATTAN 2011 150.
## 19 MANHATTAN 2012 173.
## 20 MANHATTAN 2013 187.
Population rate figures
P_Rates %>%
ggplot(aes(x=Year, y=Rate, color = Borough, group = Borough)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Sightings per 100.000 population", title = "Rate of sightings by Borough population", color = "Borough") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))
P_Rates %>%
ggplot(aes(x=Year, y=Rate, group = Borough)) +
geom_point() +
geom_line() +
facet_wrap(~ Borough) +
labs(x = "Year", y = "Sightings per 100.000 population", title = "Rate of sightings by Borough population") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))
Rat/Area rates
A_Rates <-
Rats_Per_Km %>%
pivot_longer(c('2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017'), names_to = "Year", values_to = "Rate")
as.factor(A_Rates$Year)
## [1] 2010 2011 2012 2013 2014 2015 2016 2017 2010 2011 2012 2013 2014 2015 2016
## [16] 2017 2010 2011 2012 2013 2014 2015 2016 2017 2010 2011 2012 2013 2014 2015
## [31] 2016 2017 2010 2011 2012 2013 2014 2015 2016 2017
## Levels: 2010 2011 2012 2013 2014 2015 2016 2017
head(A_Rates, n = 20)
## # A tibble: 20 × 3
## Borough Year Rate
## <chr> <chr> <dbl>
## 1 BRONX 2010 17.6
## 2 BRONX 2011 19.5
## 3 BRONX 2012 19.5
## 4 BRONX 2013 19.2
## 5 BRONX 2014 24.7
## 6 BRONX 2015 28.7
## 7 BRONX 2016 31.4
## 8 BRONX 2017 27.5
## 9 BROOKLYN 2010 17.6
## 10 BROOKLYN 2011 19.8
## 11 BROOKLYN 2012 19.5
## 12 BROOKLYN 2013 18.3
## 13 BROOKLYN 2014 21.2
## 14 BROOKLYN 2015 28
## 15 BROOKLYN 2016 32.7
## 16 BROOKLYN 2017 31.6
## 17 MANHATTAN 2010 46.3
## 18 MANHATTAN 2011 40.8
## 19 MANHATTAN 2012 47.6
## 20 MANHATTAN 2013 51.7
Area rate figures
A_Rates %>%
ggplot(aes(x=Year, y=Rate, color = Borough, group = Borough)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Sightings per square kilometre", title = "Rate of sightings by Borough area", color = "Borough") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))
A_Rates %>%
ggplot(aes(x=Year, y=Rate, group = Borough)) +
geom_point() +
geom_line() +
facet_wrap(~ Borough) +
labs(x = "Year", y = "Sightings per square kilometre", title = "Rate of sightings by Borough area", color = "Borough") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))
We can see a rather obvious trend, where Manhattan has relatively the most rat sightings by far, both when compared on population as well as area. Queens and Staten Island both keep to the lower end of spectrum.
When it comes to change over the years, we see that the Bronx, Brooklyn, and Manhattan increase at a similar rate when it comes to population rate, while Queens and Staten Island remain relatively stable. When we compare on the basis of area, we see that Manhattan seems to grow faster than the Bronx and Brooklyn, while Queens and Staten Island still remain stable.
We can look at the numbers and see which Borough actually has more sightings (as that is a variable in our dataset too).
borough_population <-
rat_dataset %>%
ggplot(aes(x = Borough, fill = Borough)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
scale_fill_manual(values=c("#882E72", "#B178A6", "#114477", "#1965B0",
"#5289C7", "#7BAFDE", "#4EB265", "#90C987",
"#CAE0AB", "#F7EE55", "#F6C141", "#F1932D",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA",
"#77AADD", "#117755", "#44AA88", "#99CCBB",
"#777711")) +
ylim(0, 40000) +
labs(x = "Borough", y = "Total Number of Rat Sightings",
title = "Number of Rat Sightings in Each Borough") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
borough_population
It looks like Brooklyn is the borough with the most rat sightings. Even though the map shows that Manhatten, for example, looks more densely populated, Manhatten is also a bigger borough, so the sightings are more evenly distributed throughout it.
How evenly exactly?
data4lat_brooklyn <-
rat_dataset %>%
filter(!is.na(Latitude)) %>%
filter(Borough == "BROOKLYN") %>%
select(Latitude)
max_brooklyn_lat <- max(data4lat_brooklyn)
min_brooklyn_lat <- min(data4lat_brooklyn)
data4lon_brooklyn <-
rat_dataset %>%
filter(!is.na(Longitude)) %>%
filter(Borough == "BROOKLYN") %>%
select(Longitude)
max_brooklyn_lon <- max(data4lon_brooklyn)
min_brooklyn_lon <- min(data4lon_brooklyn)
brooklyn_rats <-
rat_dataset %>%
filter(Borough == "BROOKLYN")
brooklyn_map <-
ggmap(newyork.map) +
geom_point(data = brooklyn_rats, mapping = aes(x = Longitude, y = Latitude), color = "Red", size = 0.1) +
ylim(min_brooklyn_lat, max_brooklyn_lat) +
xlim(min_brooklyn_lon, max_brooklyn_lon) +
ggtitle("Rat Sightings Concentration in Brooklyn")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
brooklyn_map
The rats seem to stand away from the beach and the parks (they probably have work to do, like us). The denser the subway lines get, the denser the rat population too, it seems.
Check how many location types it is
unique(rat_dataset$Location_Type)
## [1] 3+ Family Mixed Use Building Commercial Building
## [3] 1-2 Family Dwelling 3+ Family Apt. Building
## [5] Public Stairs Other (Explain Below)
## [7] Vacant Lot Construction Site
## [9] Hospital Parking Lot/Garage
## [11] Catch Basin/Sewer Vacant Building
## [13] 1-2 Family Mixed Use Building Public Garden
## [15] Government Building Office Building
## [17] School/Pre-School Day Care/Nursery
## [19] Single Room Occupancy (SRO) Summer Camp
## [21] <NA>
## 20 Levels: 1-2 Family Dwelling ... Vacant Lot
# 20 location types
There are 20 location types.
Compute the percentage of rat sighting in each location type
loc_per <- rat_dataset %>%
group_by(Location_Type) %>%
summarise(Percentage = round(100*(n() / nrow(.)), 1)) %>%
arrange(desc(Percentage))
loc_per
## # A tibble: 21 × 2
## Location_Type Percentage
## <fct> <dbl>
## 1 3+ Family Apt. Building 40.3
## 2 1-2 Family Dwelling 19.3
## 3 Other (Explain Below) 14.8
## 4 3+ Family Mixed Use Building 7.8
## 5 Commercial Building 4.9
## 6 Vacant Lot 3.6
## 7 Construction Site 2.2
## 8 Vacant Building 1.8
## 9 1-2 Family Mixed Use Building 1.7
## 10 Catch Basin/Sewer 1.1
## # … with 11 more rows
Plot the bar chart that reveals the occupation of rat sighting in each location type
loc_per %>%
bar_chart(Location_Type, Percentage,
bar_color = c("#882E72", "#B178A6", "#114477", "#1965B0",
"#5289C7", "#7BAFDE", "#4EB265", "#90C987",
"#CAE0AB", "#F7EE55", "#F6C141", "#F1932D",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA",
"#77AADD", "#117755", "#44AA88", "#99CCBB",
"#777711")) +
geom_text(aes(label = paste(Percentage , "%")),
hjust = -0.2) +
ylim(NA, 45) +
theme(plot.title = element_text(hjust=0.5)) +
labs(x = "", y = "", title = "Percentage of Rat Sighting in Different Location Type",
fill = "Location Type") +
theme_minimal()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
Following the implications of the bar chart, the most frequent cases of rat sightings occurred in apartment buildings with 3 or more family members, accounting for 40.3%. This is followed by residences with 1 to 2 family members at 19.3%. Notably, mixed-use buildings with 3 or more family members also accounted for 7.8%. Overall, the types of locations associated with families held the most complaints of rat sightings.
What type of location do these happen in, and how are they distributed on the map?
location_subset1 <- subset(rat_dataset, Location_Type %in% c("1-2 Family Dwelling", "1-2 Family Mixed Use Building", "3+ Family Apt. Building", "3+ Family Mixed Use Building"))
location_map1 <-
ggmap(newyork.map) +
geom_point(data = location_subset1, mapping = aes(x = Longitude, y = Latitude, color = Location_Type), alpha = 0.3) +
facet_wrap(~ Location_Type, nrow = 2) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration in Different Locations")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
location_map1
location_subset2 <- subset(rat_dataset, Location_Type %in% c("Catch Basin/Sewer", "Commercial Building", "Construction Site", "Day Care/Nursery"))
location_map2 <-
ggmap(newyork.map) +
geom_point(data = location_subset2, mapping = aes(x = Longitude, y = Latitude, color = Location_Type), alpha = 0.3) +
facet_wrap(~ Location_Type, nrow = 2) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration in Different Locations")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
location_map2
location_subset3 <- subset(rat_dataset, Location_Type %in% c("Government Building", "Hospital", "Office Building", "Other (Explain Below)"))
location_map3 <-
ggmap(newyork.map) +
geom_point(data = location_subset3, mapping = aes(x = Longitude, y = Latitude, color = Location_Type), alpha = 0.3) +
facet_wrap(~ Location_Type, nrow = 2) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration in Different Locations")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
location_map3
location_subset4 <- subset(rat_dataset, Location_Type %in% c("Parking Lot/Garage", "Public Garden", "Public Stairs", "School/Pre-School"))
location_map4 <-
ggmap(newyork.map) +
geom_point(data = location_subset4, mapping = aes(x = Longitude, y = Latitude, color = Location_Type), alpha = 0.3) +
facet_wrap(~ Location_Type, nrow = 2) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration in Different Locations")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
location_map4
location_subset5 <- subset(rat_dataset, Location_Type %in% c("Single Room Occupancy (SRO)", "Summer Camp", "Vacant Building", "Vacant Lot"))
location_map5 <-
ggmap(newyork.map) +
geom_point(data = location_subset5, mapping = aes(x = Longitude, y = Latitude, color = Location_Type), alpha = 0.3) +
facet_wrap(~ Location_Type, nrow = 2) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration in Different Locations")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
location_map5
Looking at the Family Buildings, it is actually scary how evenly spread out (and still, always concentrated) around the maps. Even in Staten Island, which we already saw had the least concentration, we can still see that most of them were in Family Buildings. Looking at the map for the sightings at Sewers, it is no surprise the biggest concentration is in Manhatten, which is famously known in movies for having rats come out of sewers, as well as walking around stealing Dolar Slices, so you can see that Manhatten is also the one with more concentration in Public Spaces. It is also no surprise most of the sightings at Government Buildings are in Manhatten, as most of the Government buildings in New York are in Downtown Manhatten. There are no rat sightings in Office Buildings in Staten Island, as it is a Borough known to be mostly residential. For the “Other (Explain Below)”, it makes sense for this one to be the most densely populated (as there are many building types not listed as the primary ones) and the one with the sightings more evenly distributed (as there are “other” types of buildings everywhere).
Plot the Percentage of rat sightings by month
# compute the percentage of each month on rat sighting
month_per <- rat_dataset %>%
group_by(Created_Month) %>%
summarise(Percentage = round(100*(n() / nrow(.)),1)) %>%
arrange(desc(Percentage))
month_per
## # A tibble: 12 × 2
## Created_Month Percentage
## <fct> <dbl>
## 1 07 11.8
## 2 08 11.4
## 3 06 11.1
## 4 05 10.6
## 5 09 9.5
## 6 04 8.7
## 7 10 7.8
## 8 03 7.6
## 9 02 5.8
## 10 01 5.6
## 11 11 5.3
## 12 12 4.8
month_per %>%
mutate(Month = Created_Month,
Month = recode_factor(Month, '01' = "January", '02' = "Febuary",
'03' = "March", '04' = "April", '05' = "May", '06' = "June",
'07' = "July", '08' = "August", '09' = "September",
'10' = "October", '11' = "November", '12' = "December")) %>%
bar_chart(Month, Percentage,
bar_color = c("#882E72", "#B178A6", "#114477", "#1965B0",
"#5289C7", "#7BAFDE", "#4EB265", "#90C987",
"#CAE0AB", "#F7EE55", "#F6C141", "#F1932D")) +
geom_text(aes(label = paste(Percentage , "%")),
hjust = -0.2) +
ylim(NA, 15) +
theme(plot.title = element_text(hjust=0.5)) +
labs(x = "", y = "Percentage", title = "Percentage of Rat Sighting in Each Month",
fill = "Month") +
theme_minimal()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
A monthly percentage of rat sightings clearly reveals that rat sightings can be roughly divided by season. Rat sightings are highest in the summer months, followed by spring and fall, and lowest in the winter months.
Plot the number of rat sightings versus year
rat_dataset %>%
ggplot(aes(x = Created_Year, fill = Created_Year)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
scale_fill_manual(values=c("#882E72", "#B178A6", "#5289C7", "#7BAFDE",
"#CAE0AB", "#F7EE55", "#E8601C", "#DC050C")) +
ylim(0, 20000) +
labs(x = "Year", y = "Total Number of Rat Sighting",
title = "Number of Rat Sighting 2010 - 2017", fill = "Year") +
theme_classic() +
theme(plot.title = element_text(hjust=0.5))
Since 2010 to 2013, the number of rat sighting cases did not fluctuate measurably. Yet, the number of rat sightings escalated yearly after 2013, from 2014 to 2016, with the number of rat sightings growing from 10,534 cases in 2010 to 17,230 cases in 2016. Possible reasons behind this are that New York residents tend to report rat sightings to the administration year by year or that the population of New York rises over the period of the years, thus causing the rat population to rise with it.
Another thing that needs to be mentioned here is that data collection was interrupted in September 2017, hence the drop in rat sightings in 2017.
Plot the line chart of rat sightings versus year to more easily observe changes
# Year with rat sighting (count)
year_count <- rat_dataset %>%
group_by(Created_Year) %>%
summarise(Number = n())
year_count
## # A tibble: 8 × 2
## Created_Year Number
## <fct> <int>
## 1 2010 10534
## 2 2011 10454
## 3 2012 10643
## 4 2013 10739
## 5 2014 12617
## 6 2015 15272
## 7 2016 17230
## 8 2017 14425
# Year with rat sighting (count)
year_count %>%
ggplot(aes(x = Created_Year, y = Number, group = 1)) +
geom_point(size = 1.5) +
geom_line(color = "dark blue", size = 1) +
labs(x = "Year", y = "Number of Rat Sighting",
title = "Number of Rat Sighting 2010 - 2017", fill = "Year") +
theme_classic() +
theme(plot.title = element_text(hjust=0.5))
The line graph offers a better experience in observing the trend of rat sightings by year. The line graph reaffirms the above discussion, with an alarming increase in rat sightings from 2013 to 2016.
Mapping the number of rat sightings for each month of each year
# Number of Rat Sighting in each Months of Each Year
year_month_count <- rat_dataset %>%
group_by(Created_Year) %>%
count(Created_Month)
# Number of Rat Sighting in each Months of Each Year
year_month_count %>%
ggplot(aes(x = Created_Month, y = n, color = Created_Year, group = Created_Year)) +
geom_point() +
geom_line() +
scale_color_manual(values=c("#882E72", "black", "#EEA236", "#5CB85C",
"#46B8DA", "#9632B8", "dark green", "#DC050C")) +
labs(x = "Month", y = "Number of Rat Sighting",
title = "Number of Rat Sighting", color = "Year") +
theme_classic() +
theme(plot.title = element_text(hjust=0.5))
The graph reflects that the profile of the number of rat sightings is similar from year to year, with the most frequent rat sightings usually happening in the summer months as opposed to the less frequent rat sightings usually falling in the winter months. On the other hand, the number of rat sightings rose over time, in the case of 2017, even though the data are not yet complete, it remained noticeable that the overall number of rat sightings per month was much higher than in previous years.
Compute the length of time to solve a complaint of a rat sighting
# combine the year, month, and day of started day
starttime <- as.Date(with(rat_dataset, paste(Created_Year, Created_Month,
Created_Day, sep = "-")), "%Y-%m-%d")
# combine the year, month, and day of closed day
endtime <- as.Date(with(rat_dataset, paste(Closed_Year, Closed_Month,
Closed_Day, sep = "-")), "%Y-%m-%d")
# compute for the duration taken to solve
solved_duration <- data.frame(difftime(endtime,starttime,units = "days"))
colnames(solved_duration)[1] <- "duration"
solved_duration$duration <- as.numeric(solved_duration$duration)
# merge year created year with solved time, in order to see if the government's efficient improve
solved_duration <- cbind(rat_dataset$Created_Year, solved_duration)
colnames(solved_duration)[1] <- "Sighting_Year"
# remove the negative number, as it cannot be negative time to solve the issue
solved_duration <- solved_duration %>%
mutate(month_cat = cut(duration, breaks = c(0, 30, 60, 90, 120, 150, 180, 1000),
labels = c("less than 1 month", "less than 2 month",
"less than 3 month", "less than 4 month",
"less than 5 month", "less than half year",
"more than half year")))
# merge it to the original dataset
rat_dataset <- cbind(rat_dataset, solved_duration)
solved_duration <-
solved_duration %>%
drop_na()
Plot the time required to solve the problem
solved_duration %>%
ggplot(aes(x = month_cat, fill = month_cat)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
scale_fill_manual(values=c("#882E72", "#5289C7", "#CAE0AB",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA")) +
ylim(0, 70000) +
labs(x = "Time to Solve", y = "Cases Count",
title = "Time to Solve Rat Sighting", fill = "Month") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1, vjust = 0),
plot.title = element_text(hjust=0.5))
Apparently, the majority of rat sighting complaints were resolved within 1 month, with 64,371 cases, followed by less than 2 months, with 5,599 cases, suggesting that the administration is efficient in tackling rat sighting problems.
Plot the percentage of rat sighting case resolution times by year
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
solved_duration %>%
filter(Sighting_Year == c(2010, 2011, 2012), duration > 0) %>%
ggplot(aes(x= month_cat, group=Sighting_Year)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes(label = scales::percent(round((..prop..),3)),
y= ..prop.. ),
stat= "count", vjust = -0.5, hjust = 0.5, angle = 0,
posistion = position_dodge(width=0.9), size = 3) +
scale_fill_manual(values = c("#882E72", "#5289C7", "#CAE0AB",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA"),
labels = c("less than 1 month", "less than 2 month",
"less than 3 month", "less than 4 month",
"less than 5 month", "less than half year",
"more than half year")) +
ylim(0, 100) +
labs(x = "Time to Solve", y = "Cases Count",
title = "Time to Solve Rat Sighting", fill = "Month") +
facet_grid(~Sighting_Year) +
scale_y_continuous(labels=percent) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1, vjust = 0),
plot.title = element_text(hjust=0.5))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
solved_duration %>%
filter(Sighting_Year == c(2013, 2014, 2015)) %>%
ggplot(aes(x= month_cat, group=Sighting_Year)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes(label = scales::percent(round((..prop..),3)),
y= ..prop.. ),
stat= "count", vjust = -0.5, hjust = 0.5, angle = 0,
posistion = position_dodge(width=0.9), size = 3) +
scale_fill_manual(values = c("#882E72", "#5289C7", "#CAE0AB",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA"),
labels = c("less than 1 month", "less than 2 month",
"less than 3 month", "less than 4 month",
"less than 5 month", "less than half year",
"more than half year")) +
ylim(0, 100) +
labs(x = "Time to Solve", y = "Cases Count",
title = "Time to Solve Rat Sighting", fill = "Month") +
facet_grid(~Sighting_Year) +
scale_y_continuous(labels=percent) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1, vjust = 0),
plot.title = element_text(hjust=0.5))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
solved_duration %>%
filter(Sighting_Year == c(2016, 2017)) %>%
ggplot(aes(x= month_cat, group=Sighting_Year)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes(label = scales::percent(round((..prop..),3)),
y= ..prop.. ),
stat= "count", vjust = -0.5, hjust = 0.5, angle = 0,
posistion = position_dodge(width=0.9), size = 4) +
scale_fill_manual(values = c("#882E72", "#5289C7", "#CAE0AB",
"#E8601C", "#DC050C", "#D6C1DE", "#4477AA"),
labels = c("less than 1 month", "less than 2 month",
"less than 3 month", "less than 4 month",
"less than 5 month", "less than half year",
"more than half year")) +
ylim(0, 100) +
labs(x = "Time to Solve", y = "Cases Count",
title = "Time to Solve Rat Sighting", fill = "Month") +
facet_grid(~Sighting_Year) +
scale_y_continuous(labels=percent) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1, vjust = 0),
plot.title = element_text(hjust=0.5))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
As seen in the graph, the overall efficiency of solving rat sighting cases has improved year by year, from 73.2% of cases resolved within 1 month in 2010 to approximately 80% of cases closed within 1 month in 2011, 2012, and further to 90% of cases settled within 1 month in 2013 to 2017.
Does the location matter when it comes to how long they take to solve a case?
duration_map <-
ggmap(newyork.map) +
geom_point(data = rat_dataset %>% filter_at(vars(Longitude, Latitude, month_cat),all_vars(!is.na(.))), mapping = aes(x = Longitude, y = Latitude, color = month_cat), alpha = 0.3) +
facet_wrap(~ month_cat) +
ylim(min_lat, max_lat) +
xlim(min_lon, max_lon) +
ggtitle("Rat Sightings Concentration for Various Solving Times")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
duration_map
The cases that take longer than 4 months to be closed are mostly localised in Manhatten, which makes sense, because, as you will see later, Mahantten is the borough with the biggest rat/population ratio.
There are clear relations between most of the variables in the
rat_sightings dataset. Particularly, we see that the place
with the biggest rat sighting concentration is Manhatten, also having
the highest rat/person ratio and the longest solving times. However,
there are more overall sightings in Brooklyn. Most sightings happen at
Family Buildings and most of the cases are solved within one month.